Incorporating Nuisance Parameters in Likelihoods for Multisource 
Spectra 



J. S. Conway 

University of California, Davis, USA 
Abstract 

We describe here the general mathematical approach to constructing likeli- 
hoods for fitting observed spectra in one or more dimensions with multiple 
sources, including the effects of systematic uncertainties represented as nui- 
sance parameters, when the likelihood is to be maximized with respect to 
these parameters. We consider three types of nuisance parameters: simple 
multiplicative factors, source spectra "morphing" parameters, and parameters 
representing statistical uncertainties in the predicted source spectra. 

1 Overview 

In particle physics one often encounters the general problem of estimating physical parameters such as 
particle masses or cross sections from the spectra of observables calculated in each event. In the case 
of a known, well-established signal process, the dominant technique by far is to use a binned likelihood 
assuming a Poisson distribution in each bin [H, and find the parameters which maximize the likelihood. 

In the case of a search for a new particle or effect resulting in either a discovery or null result, 
binned likelihoods have also been employed successfully to quote statistical significance or exclusion 
bounds, respectively. From a certain point of view there is a desirable consistency in utilizing the same 
basic statistical method for searches, discoveries, and measurements. 

A key requirement here, however, is that the likelihood somehow incorporate the effects of all 
systematic uncertainties present in the analysis. In frequentist methods this is almost always achieved by 
generating distributions of many pseudoexperiments, where from one pseudoexperiment to the next the 
values of all parameters are varied within their assumed distributions. In a formal Bayesian treatment, 
the nuisance parameters are removed by marginalization: integrating them out, assuming some prior pdf. 
Both of these approaches are computationally very expensive. 

In measuring parameters using binned Poisson likelihoods, as mentioned above, one simply max- 
imizes the likelihood (in practice one minimizes the negative log of the likelihood) with respect to all m 
free parameters, and then constructs the standard error ellipsoid in m-dimensional space. The fit values 
of the nuisance parameters are typically of no interest, leaving one to interpret the intervals for just the 
parameters of interest in a straightforward way [2 1. 

We define in this paper three main types of nuisance parameters representing systematic uncertain- 
ties on the source distributions, and describe how to incorporate them into a binned Poisson likelihood. 

We further argue in this paper that this maximum likelihood method, also called the profile likeli- 
hood, can be applied to searches and discoveries as well, either by a pseudo-Bayesian interpretation of 
the profile likelihood as representing a posterior density in the parameter(s) of interest, or by likelihood 
ratio methods. The profile likelihood requires significantly less computer time, often a much as two 
orders of magnitude less, than frequentist or frequentist- inspired methods such as CLg Ii3j|. That in turn 
allows much more detailed study of the properties of the fit results. 

2 Core of the Poisson Likelihood 

Suppose we observe in a set of N events an observable or in general a set of observables x. If we define 
a set of ni)in bins (which can be of literally any shape we choose) in the space of the observables, then 



the number of events in each bin i, is assumed to be Poisson-distributed according to 

V{ni\^^i) = f^^^-^ (1) 

rii' 

where /i j is the number of expected events in the bin. Typically we can write 

f^source 

/Xj = ^ Lajeji (2) 

for integrated luminosity L, cross section aj for source j, and efficiency eji for source j in bin i, often 
obtained from MC simulation of the process. The sources here include the signal process of interest 
and all background processes. Again, since we are dealing with a possibly multidimensional space of 
observables, the index i can actually label the bins in multiple dimensions. 

The Poisson likelihood for the full observed spectrum is simply the product of the Poisson proba- 
bilities: 

N 

jC = l[r{ni\iii) . (3) 

1=1 

In the absence of any systematic uncertainties one can simply minimize — In £ with respect to all un- 
known parameters in the problem and interpret the resulting standard error ellipsoid in the normal way 
to obtain estimates of the unknown parameters and associated confidence intervals. 



3 Multiplicative Uncertainties 

Multiphcative uncertainties provide the simplest example of systematic uncertainties that can be repre- 
sented by nuisance parameters in profile likelihoods. As an example, let us assume that the integrated 
luminosity is measured in some auxiUary study, and results in a 2% uncertainty. We would rewrite the 
likelihood as 

TV 

£ = '[[V{ni\^ii)giL\L,aL) (4) 

1=1 

for the measured value L ± ctl. The function ^ is a normalized Gaussian of mean L and width aL, which 
serves to constrain the value of the new nuisance parameter L to its measured value. Note that it is L and 
not L that is used to calculate the Hi. The negative log UkeUhood is thus 

(L — 

-lnC = ^ [-Hi In jii + iii] H — (5) 

i ^ 

and thus the remnant of the Gaussian term can be regarded as a penalty on the negative log likelihood. 
It is in principle possible to use functions other than Gaussians to constrain the values of the nuisance 
parameters. In Bayesian terms the constraint functions are simply the prior probability densities of the 
nuisance parameters. 

Any multiphcative uncertainty can be represented in the UkeUhood this way, including uncertain- 
ties on cross sections, overall efficiencies, and the like. One can also introduce multiphcative nuisance 
parameters into Eq. 2 as needed, for any or all sources. 

In many cases, however, the allowed physical bound on a multiplicative nuisance parameter is that 
it remain positive. If we are representing the constraint by a Gaussian, then when the uncertainty in the 
nuisance parameter is large the Gaussian is truncated and an appropriate normahzation factor should be 
included. In such cases one might also consider constraining the parameter with a log normal or other 
probability density which does not allow the parameter to vbecome negative. 
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4 Shape Uncertainties and Morphing 

Many systematic uncertainties result in an overall distortion in the shape of the observed spectrum. 
A good example is an energy scale uncertainty which affects all jet energies in an event in the same 
direction. If there are energy thresholds in the event selection, changes in not only the shape but the 
overall normalization of the efficiency (represented here by the eji for source j in bin i) as a function of 
the observables can result. 

Such spectral distortions can be modeled by altering parameters (like the energy scale) in the 
MC simulation and recalculating the "shifted" set of efficiencies. If we were, for example, to raise and 
lower the energy scale by one standard deviation, recalculating the efficiencies, we would then have three 
measures of the shape (and normalization) of the bin efficiency distribution, which we can call ej^, e^^, 
and ej-. Clearly one can obtain more measures from other alterations of the energy scale, though this can 
often be computationally very expensive. 

We then face the question of how to turn our three measures of the spectral shape into a continuous 
estimate in each bin as a function of the energy scale factor. To do this we introduce a "morphing" 
parameter which we will call /, and which is nominally zero (in the case of no scale shift), and which 
has some uncertainty (usually Gaussian) o"j=l. 

In this general technique, usually called "vertical morphing", we might write the efficiency in a 
bin as a function of the morphing parameter as 

e+ - e" 
^ji = <^ii + / 2 ■ 

In this expression we see that we are treating the difference in the shifted efficiencies in the bin as if they 
represent a measurement of the first-order Taylor expansion around the nominal value. This may or may 
not be a reliable indicator of how the efficiency spectrum changes with energy scale. Also note that for 
/ = ±1 the above expression does not actually yield e^! 

To provide a better estimate of the true behavior of the spectral distortions we have introduced a 
technique whereby we interpolate quadratically for [/[ < 1 and extrapolate linearly beyond that range. 
This does result in the exact measured behavior of the spectrum at / = ±1 but avoids large deviations 
from linear behavior outside the range. The value of the efficiency at any | / 1 < 1 can be determined by 
Lagrange interpolation: 

= ^^^^7. -(/-!)(/ + 1)4 + ^^^4. (7) 

Calculation of the linear extrapolation beyond this range is a straightforward exercise for the reader. 

Clearly if a more accurate representation of the morphing behavior is required, one can, at the 
expense of computation and bookkeeping time obtain additional shifted efficiency spectra and interpolate 
using a higher order polynomial. A good measure of whether this is a worthwhile exercise is to examine 
the behavior of one's morphing parameters as a function of the parameter of interest; if they tend to go 
far from the sampled region (corresponding to one standard deviation in the uncertainty) then it may be 
desirable to obtain more measurements there, and parametrize the measured region with a higher order 
polynomial. 

We also note that there are somewhat more sophisticated methods such as Alex Read's "horizontal 
morphing" lH method. These are more computationally intensive, but could be advantageous. However 
they are not straightforwardly defined in more than one dimension. 

The morphing method presented here can be extended to several morphing parameters for different 
systematic effects simply by adding linearly the deviations from the nominal efficiency due to each effect. 
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5 Statistical Uncertainties in Efficiencies 

Typically one estimates the efficiency of each source in each bin using a Monte Carlo simulation, and 
hence the statistical accuracy of the estimate of the efficiency in each bin depends on the number of MC 
events falling there. Likewise, in other, possibly data-driven methods for estimating the expected number 
of events from some source in some bin, there may be some known statistical uncertainty in each bin. 

Barlow and Beeston f5l proposed a method for representing such systematic uncertainties wherein 
one introduces a separate nuisance parameter multiplying the expected number of events from each 
source in each bin. Nominally the value of these parameters is 1, and one can then constrain the parame- 
ters, which we call /3jj, according to the prior pdf assumed for the number of MC events in the efficiency 
calculation. Barlow and Beeston assumed a Poisson distribution (though one might argue a binomial is 
the most correct form to assume); other choices such as log normal avoid the parameters possibly tending 
to zero. 

Though this method introduces a very large number of new free parameters in the likelihood, 
the problem can be seen to be tractable in the profile likelihood case since the values of the Pji which 
maximize the likelihood within a bin can be found independently of those in all the other bins. 

Assuming a Gaussian constraint on the /3jj, we can write the contribution to the negative log 
likeUhood in a particular bin as 

- InCi = -rii ln(^ f3jifiji) + ^ f3jiHji + ^ • 

3 3 3 

This contribution can be extremized with respect to the fiji by setting the derivative with respect to each 
to zero. Dropping the bin index i for clarity we write 



g(-ln£) _ 



1 



- 1 

+ ^^^ = . (9) 



We thus arrive at a set of nonlinear equations for the /3j in a bin. These can be approximately solved by 
iterative Newton-type methods, or by more sophisticated methods. 

In the context of performing the profile likelihood using MINUIT minimization, one can im- 
plement this Barlow-Beeston type method by solving for the Pji within the "objective" function which 
provides to MINUIT the value of — In L given the values of all the parameters in the fit, and include the 
contribution of the deviations of the 13 ji from unity to — In L. 

However, a problem arises in this approach. Any minimization algorithm can only approximate 
the values of the parameters and, hence, the true minimum of a function. There is always some last step 
which meets the convergence criterion, and somewhere in the space of the input /ijj to the minimization 
for the Pji, one will find the place where that last step is not taken. Near such points the values of 
the resulting fiji and their associated contribution to — In L undergo a small discontinuous jump. Such 
jumps can (and do) dramatically confuse MINUIT's MIGRAD minimizer, which attempts to measure 
the Hessian matrix by finite differences. These jumps cause the resulting parameter covariance matrix 
to become non-positive-definite. When MINUIT detects such a situation it attempts to circumvent it 
by adding to the offending diagonal element of the matrix an amount necessary to restore positive- 
definiteness. Sometimes this works but in many cases all is lost: MINUIT is now dealing with a false 
measure of the Hessian matrix and it tends to send the free parameters in the fit to wild values. We have 
found no solution to this behavior short of rewriting MINUIT. 

The full-blown Barlow-Beeston method for dealing with bin statistical uncertainties is not ab- 
solutely required to represent them properly in the likelihood. What matters, in a bin, is the overall 
statistical uncertainty of the predicted number of events from all sources. The statistical uncertainties for 
each source in each bin are independent, and can be readily combined, particularly if they are Gaussian 
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or Poisson in nature. Thus, a single Barlow-Beeston type parameter is sufficient to represent the statisti- 
cal uncertainty. The value of this parameter, and its contribution to — In L, can be calculated exactly by 
solving a quadratic equation. Using a simplified notation for a single bin, we write 

(0 - I? 

-lnC = -nln0li + Pli+ ^^ / (10) 

where here ^ is the total number of expected events in the bin, given the values of all the other parameters, 
and afs is the relative (statistical) uncertainty in the prediction. Setting the derivative to zero we find the 
quadratic equation 

13^ + (pa}-l)f3 ~nal = (11) 

which can be solved readily and the correct root taken. The extension to other constraint functions is 
straightforward though may result in transcendental equations to solve. 

6 Practical Considerations 

Care must be taken in using the approach described in this paper to avoid a number of potential pitfalls 
which we discuss here. 

Sparsely Populated Bins 

In multi-bin spectra (particularly multi-dimensional spectra) one can encounter situations where 
the number of events per bin varies by orders of magnitude. This can sometimes lead to situations where 

- there can be regions of zero-content bins, surrounded by bins populated by single MC events; 

- such single MC -event-bins can migrate under the influence of the morphing systematic effects, 
spoiling the vertical morphing method; 

- single data events can appear in bins where there is no predicted rate. 

All of these situations must be avoided. The most straightforward is to generate sufficient Monte Carlo in 
all bins, but this may not be practical or even possible. The best alternative is to combine bins according 
to some algorithm (which does not use the observed data distribution!) which ensures some minimum 
statistical threshold in every bin in the fit. 

Bins Entering/Leaving the Likelihood 

It is also necessary to ensure that no bin enters or leaves the Ukehhood as the parameters change. 
It is not impossible for MINUIT to drive parameters to regions where the contribution from a source, or 
even all sources, vanishes in a bin. For example, when studying the profile likelihood as a function of 
some new particle signal, also, one in general wants to evaluate the likelihood for the case of zero signal. 
But if there are bins populated by signal only, this can cause the contribution to go to zero, the logarithm 
of which is of course — oo. 

Simply excluding bins from the likelihood when there are no expected events is not a sufficient 
solution to this problem, as a moment's reflection will make clear. To avoid bins entering/leaving the fit, 
therefore, the bins to be used or not used must be established a priori by finding all bins where some 
contribution is expected, and making sure there are no bins with data but no expected contribution. Once 
determined, this set must remain fixed for the duration of the calculation. 

One way to ensure that no bin leaves the calculation is to always have it contribute at least some 
tiny amount. For example to circumvent the zero-signal issue, we always ensure that the signal cross 
section is no less than 10~^° pb, and that no source in any bin used in the fit ever contributes less 
than 10~^° expected events. Though this is a somewhat inelegant solution to a nevertheless important 
problem, we note that our final results do not depend on these minimum values in practice. 
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7 Pseudo-Bayesian Posterior Densities 

For measuring physical parameters, the profile likelihood can be directly interpreted using the usual 
A(In L) approach to derive confidence intervals in multi-dimensional parameter space. 

To extend this treatment to setting exclusion bounds on parameters such as a hypothetical new 
particle's cross section cjx, we can simply derive a posterior density by treating the profile likeUhood, 
which we shall denote Cmax as one would any likelihood using Bayes' Theorem: 

T^{(^x) = 7007 — t:^. — ri — (12) 

where here V{ax) is the assumed prior pdf in ax-^ 

But does the profile method really result in a posterior density that can be interpreted in this way? 
The most proper Bayesian treatment would not maximize the likelihood with respect to the parameters 
not of interest, but marginalize instead, resulting in what we might denote as C{ax) to highlight the fact 
that the marginalized likelihood is in a sense averaged over the prior-weighted values of the nuisance 
parameters. 

We have performed both calculations, profiling and marginalization, in a variety of complex spec- 
trum fits, and it is our experience that the posterior density derived either way is nearly identical, though 
the marginalized one takes orders of magnitude more compute time. Due to this practical consideration 
alone we employ the profile method and consider it to be a near-perfect representation of a full and proper 
Bayesian marginalization treatment. 



8 Conclusions 

We present in this paper the basic mathematical and numerical approach to fitting multi-source spectra 
using a profile likelihood in which various types of systematic uncertainties are incorporated by rep- 
resenting them by nuisance parameters. This method, we believe, offers a unified approach to setting 
exclusion bounds, making discoveries, and ultimately performing measurements on a wide range of par- 
ticle physics data analyses. 
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'All the usual inveighments against improper priors apply at this point. We would like to point out, however, that in every 
case of which we are aware, where such a posterior is used to quote confidence intervals on the parameter of interest in an 
actual measurement of that parameter, no one ever uses a prior other than a uniform one. 
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